Blocking: An R Package for Blocking of Records for Record Linkage and Deduplication

An abstract of less than 250 words.

Maciej Beręsewicz https://maciejberesewicz.com (University of Economics and BusinessStatisical Office in Poznań) , Adam Struzik (Adam Mickiewicz UniversityStatisical Office in Poznań)
2025-11-09

1 Introduction

Interactive data graphics provides plots that allow users to interact them. One of the most basic types of interaction is through tooltips, where users are provided additional information about elements in the plot by moving the cursor over the plot.

This paper will first review some R packages on interactive graphics and their tooltip implementations. A new package ToOoOlTiPs that provides customized tooltips for plot, is introduced. Some example plots will then be given to showcase how these tooltips help users to better read the graphics.

2 Background

Some packages on interactive graphics include plotly (Sievert 2020) that interfaces with Javascript for web-based interactive graphics, crosstalk (Cheng and Sievert 2021) that specializes cross-linking elements across individual graphics. The recent R Journal paper tsibbletalk (Wang and Cook 2021) provides a good example of including interactive graphics into an article for the journal. It has both a set of linked plots, and also an animated gif example, illustrating linking between time series plots and feature summaries.

3 Blocking of records using blocking function

4 Integration with existing packages

5 Case study

5.1 Record linkage example

Let us first load the required packages.

We will demonstrate the use of blocking function for record linkage with the foreigners dataset included in the package. This fictional representation of the foreign population in Poland was generated based on publicly available information, preserving the distributions from administrative registers. It contains 110,000 rows with 100,000 entities. Each row represents one record, with the following columns:

data(foreigners)
head(foreigners)
    fname  sname    surname       date region country true_id
   <char> <char>     <char>     <char> <char>  <char>   <num>
1:   emin            imanov 1998/02/05            031       0
2: nurlan        suleymanli 2000/08/01            031       1
3:   amio        maharrsmov 1939/03/08            031       2
4:   amik        maharramof 1939/03/08            031       2
5:   amil        maharramov 1993/03/08            031       2
6:  gadir        jahangirov 1991/08/29            031       3

We split the dataset into two separate files: one containing the first appearance of each entity in the foreigners dataset, and the other containing its subsequent appearances.

foreigners_1 <- foreigners[!duplicated(foreigners$true_id), ]
foreigners_2 <- foreigners[duplicated(foreigners$true_id), ]

Now in both datasets we remove slashes from the date column and create a new string column that concatenates the information from all columns (excluding true_id) in each row.

foreigners_1[, date := gsub("/", "", date)]
foreigners_1[, txt := paste0(fname, sname, surname, date, region, country)]
foreigners_2[, date := gsub("/", "", date)]
foreigners_2[, txt := paste0(fname, sname, surname, date, region, country)]
head(foreigners_1)
    fname  sname    surname     date region country true_id
   <char> <char>     <char>   <char> <char>  <char>   <num>
1:   emin            imanov 19980205            031       0
2: nurlan        suleymanli 20000801            031       1
3:   amio        maharrsmov 19390308            031       2
4:  gadir        jahangirov 19910829            031       3
5:   zaur         bayramova 19961006  01261     031       4
6:   asif          mammadov 19970726            031       5
                             txt
                          <char>
1:         eminimanov19980205031
2:   nurlansuleymanli20000801031
3:     amiomaharrsmov19390308031
4:    gadirjahangirov19910829031
5: zaurbayramova1996100601261031
6:       asifmammadov19970726031

We use the newly created columns in the blocking function, which relies on the default rnndescent (Nearest Neighbor Descent) algorithm based on cosine distance. Additionally, we set verbose = 1 to monitor progress.

result_reclin <- blocking(x = foreigners_1$txt, y = foreigners_2$txt, verbose = 1)
===== creating tokens =====
===== starting search (nnd, x, y: 100000, 10000, t: 1232) =====
===== creating graph =====

Now we examine the results of record linkage.

result_reclin
========================================================
Blocking based on the nnd method.
Number of blocks: 6469.
Number of columns used for blocking: 1232.
Reduction ratio: 0.9999.
========================================================
Distribution of the size of the blocks:
   2    3    4    5    6    7 
3916 1604  926   19    2    2 

Structure of the object is as follows:

str(result_reclin, 1)
List of 8
 $ result        :Classes 'data.table' and 'data.frame':    10000 obs. of  4 variables:
  ..- attr(*, ".internal.selfref")=<externalptr> 
 $ method        : chr "nnd"
 $ deduplication : logi FALSE
 $ representation: chr "shingles"
 $ metrics       : NULL
 $ confusion     : NULL
 $ colnames      : chr [1:1232] "0a" "0b" "0c" "0m" ...
 $ graph         : NULL
 - attr(*, "class")= chr "blocking"

The resulting data.table has four columns:

head(result_reclin$result)
       x     y block      dist
   <int> <int> <num>     <num>
1:     3     1     1 0.2216882
2:     3     2     1 0.2122737
3:    21     3     2 0.1172652
4:    57     4     3 0.1863238
5:    57     5     3 0.1379310
6:    61     6     4 0.2307692

Let’s examine the first pair. Obviously, there are typos in the fname and surname. Nevertheless, the pair appears to be a match.

cbind(t(foreigners_1[3, 1:6]), t(foreigners_2[1, 1:6]))
        [,1]         [,2]        
fname   "amio"       "amik"      
sname   ""           ""          
surname "maharrsmov" "maharramof"
date    "19390308"   "19390308"  
region  ""           ""          
country "031"        "031"       

Now we use the true_id values to evaluate our approach.

matches <- merge(x = foreigners_1[, .(x = 1:.N, true_id)],
                 y = foreigners_2[, .(y = 1:.N, true_id)],
                 by = "true_id")
matches[, block := rleid(x)]
head(matches)
Key: <true_id>
   true_id     x     y block
     <num> <int> <int> <int>
1:       2     3     1     1
2:       2     3     2     1
3:      20    21     3     2
4:      56    57     4     3
5:      56    57     5     3
6:      60    61     6     4

We have 10,000 matched pairs. We use the true_blocks parameter in the blocking function to specify the true block assignments. We obtain the quality metrics for the assessment of record linkage.

result_2_reclin <- blocking(x = foreigners_1$txt, y = foreigners_2$txt, verbose = 1,
                            true_blocks = matches[, .(x, y, block)])
===== creating tokens =====
===== starting search (nnd, x, y: 100000, 10000, t: 1232) =====
===== creating graph =====
result_2_reclin
========================================================
Blocking based on the nnd method.
Number of blocks: 6469.
Number of columns used for blocking: 1232.
Reduction ratio: 0.9999.
========================================================
Distribution of the size of the blocks:
   2    3    4    5    6    7 
3916 1604  926   19    2    2 
========================================================
Evaluation metrics (standard):
     recall   precision         fpr         fnr    accuracy 
    96.7782     78.7000      0.0038      3.2218     99.9957 
specificity    f1_score 
    99.9962     86.8079 

6 Customizing tooltip design with ToOoOlTiPs

ToOoOlTiPs is a packages for customizing tooltips in interactive graphics, it features these possibilities.

The palmerpenguins data (Horst et al. 2020) features three penguin species which has a lovely illustration by Alison Horst in Figure 1.

A picture of three different penguins with their species: Chinstrap, Gentoo, and Adelie.

Figure 1: Artwork by @allison_horst

Table 1 prints at the first few rows of the penguins data:

Table 1: A basic table
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
Adelie Torgersen 39.1 18.7 181 3750 male 2007
Adelie Torgersen 39.5 17.4 186 3800 female 2007
Adelie Torgersen 40.3 18.0 195 3250 female 2007
Adelie Torgersen NA NA NA NA NA 2007
Adelie Torgersen 36.7 19.3 193 3450 female 2007
Adelie Torgersen 39.3 20.6 190 3650 male 2007

Figure 2 shows an interactive plot of the penguins data, made using the plotly package.

p <- penguins %>% 
  ggplot(aes(x = bill_depth_mm, y = bill_length_mm, 
             color = species)) + 
  geom_point()
ggplotly(p)

Figure 2: A basic interactive plot made with the plotly package on palmer penguin data. Three species of penguins are plotted with bill depth on the x-axis and bill length on the y-axis. When hovering on a point, a tooltip will show the exact value of the bill depth and length for that point, along with the species name.

8 Summary

We have displayed various tooltips that are available in the package ToOoOlTiPs.

9 Acknowledgements

Work on this package is supported by the National Science Centre, OPUS 20 grant no. 2020/39/B/HS4/00941

9.1 CRAN packages used

ToOoOlTiPs, plotly, crosstalk, tsibbletalk, rnndescent, igraph, palmerpenguins, ggplot2

9.2 CRAN Task Views implied by cited packages

ChemPhys, DynamicVisualizations, GraphicalModels, NetworkAnalysis, Optimization, Phylogenetics, Spatial, TeachingStatistics, TimeSeries, WebTechnologies

J. Cheng and C. Sievert. crosstalk: Inter-widget interactivity for HTML widgets. 2021. URL https://CRAN.R-project.org/package=crosstalk. R package version 1.1.1.
A. M. Horst, A. P. Hill and K. B. Gorman. palmerpenguins: Palmer archipelago (antarctica) penguin data. 2020. URL https://allisonhorst.github.io/palmerpenguins/. R package version 0.1.0.
C. Sievert. Interactive Web-Based Data Visualization with r, plotly, and shiny. Chapman; Hall/CRC, 2020. URL https://plotly-r.com.
E. Wang and D. Cook. Conversations in time: Interactive visualisation to explore structured temporal data. The R Journal, 2021. URL https://journal.r-project.org/archive/2021/RJ-2021-050/index.html.

References

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Beręsewicz & Struzik, "Blocking: An R Package for Blocking of Records for Record Linkage and Deduplication", The R Journal, 2025

BibTeX citation

@article{paper-blocking,
  author = {Beręsewicz, Maciej and Struzik, Adam},
  title = {Blocking: An R Package for Blocking of Records for Record Linkage and Deduplication},
  journal = {The R Journal},
  year = {2025},
  issn = {2073-4859},
  pages = {1}
}